Skip to content

[Dist][Release] Use one wheel for different CUDA version#1826

Open
oraluben wants to merge 20 commits intotile-ai:mainfrom
oraluben:update-nvruntime-stubs
Open

[Dist][Release] Use one wheel for different CUDA version#1826
oraluben wants to merge 20 commits intotile-ai:mainfrom
oraluben:update-nvruntime-stubs

Conversation

@oraluben
Copy link
Collaborator

@oraluben oraluben commented Feb 9, 2026

Followup of #1821
Validate that wheels built with CUDA 12.8 works with CUDA 11.8 (torch 2.4) and CUDA 13 (torch 2.10)

After this PR, libcudart from host toolchain is not loaded.

for libnvrtc, both host lib and pip lib are loaded by torch. this behavior stays same from torch 2.4 to 2.10.

Summary by CodeRabbit

  • Chores

    • Simplified CI by removing nightly CUDA 13.0 branches and related nightly-specific handling; unified Python/version mapping and build/test flows to use a single version variable.
  • Refactor

    • Streamlined CUDA runtime and NVRTC probing to prefer globally available symbols, reduced initialization complexity, and fail fast with clearer error messages when required symbols are absent.
    • Adjusted stub initialization to proceed to symbol validation rather than early-returning.
  • Documentation

    • Added detailed runtime internals docs and motivation notes explaining the CUDA stub design, loading strategy, and build-time toggle.

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Removed nightly-specific CI branches and simplified CUDA/NVRTC/CUDART stub loading to probe global symbols first via dlsym(RTLD_DEFAULT) then dlsym(RTLD_NEXT), emitting an error and aborting if target symbols are absent.

Changes

Cohort / File(s) Summary
CI/CD Workflow
\.github/workflows/dist.yml
Removed Nightly-CUDA-13.0 entries and nightly-specific branches; dropped Python 3.9 test-version mapping; switched setup-uv to use matrix.python-version; removed UV_INDEX/nightly Torch-install logic.
CUDA Runtime & NVRTC stubs
src/target/stubs/cudart.cc, src/target/stubs/nvrtc.cc
Replaced multi-SONAME/dlopen probing and RTLD_NOLOAD reuse with global-symbol probes using dlsym(RTLD_DEFAULT) then dlsym(RTLD_NEXT) (checks cudaGetErrorString / nvrtcVersion); added _GNU_SOURCE, stdio.h, stdlib.h; emit standardized error and abort() if symbols missing; removed path-based dlopen loops.
Docs & Header
docs/runtime_internals/stubs.md, src/target/stubs/cuda.h
Added detailed documentation of stub mechanism, loading strategy, error semantics, and build-time toggle; expanded header motivation text for lazy-loading CUDA stubs.

Sequence Diagram(s)

sequenceDiagram
  participant App as "App / Stub"
  participant Linker as "Dynamic Linker\n(RTLD_DEFAULT / RTLD_NEXT)"
  participant Lib as "CUDA lib\n(if loaded)"

  App->>Linker: dlsym(RTLD_DEFAULT, "symbol")
  alt found in RTLD_DEFAULT
    Linker-->>App: symbol pointer
    App->>Lib: call via global symbol (no dlopen)
  else not found
    App->>Linker: dlsym(RTLD_NEXT, "symbol")
    alt found in RTLD_NEXT
      Linker-->>App: symbol pointer
      App->>Lib: call via next-symbol resolver
    else not found
      Linker-->>App: NULL
      App->>App: print error and abort()
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

enhancement, dependencies

Suggested reviewers

  • XuehaiPan
  • LeiWang1999

Poem

🐰 I hopped through symbols, keen and bright,
I sniffed DEFAULT first in the night.
If NEXT replied, I twitched my nose,
If none were found, I thumped my woes.
A tidy hop, and off I go — goodnight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[Dist][Release] Use one wheel for different CUDA version' directly addresses the main objective: enabling a single wheel to work across multiple CUDA versions (11.8, 12.8, 13) by refactoring the stub loading mechanism.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@XuehaiPan XuehaiPan marked this pull request as draft February 9, 2026 09:51
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/dist.yml (1)

117-126: ⚠️ Potential issue | 🟡 Minor

Wheels only validated with basic importability — cross-CUDA runtime testing missing.

The dist.yml workflow builds wheels with CUDA 12.8 (lines 118–120) and only tests basic importability: import tilelang (line 203). If wheels built with CUDA 12.8 are claimed to work with CUDA 11.8 and CUDA 13, no CI workflow validates this. The ci.yml workflow also tests only CUDA 12.8, with no matrix entries for other CUDA versions.

🤖 Fix all issues with AI agents
In `@src/target/stubs/cudart.cc`:
- Around line 64-75: TryLoadLibCudart currently returns RTLD_DEFAULT which can
resolve to the stub's own exported cudaGetErrorString causing infinite recursion
and also fails to dlopen libcudart from disk; update TryLoadLibCudart to first
probe with dlsym(RTLD_NEXT, "cudaGetErrorString") (or check a sentinel boolean
that marks "in-stub" symbols to avoid self-resolution) and if that fails attempt
to dlopen common libcudart filenames (e.g., "libcudart.so", versioned names)
with RTLD_NOW|RTLD_LOCAL and return that handle; ensure CreateCUDARuntimeAPI
still uses the returned handle (not the stub) so function pointers point to real
libcudart symbols and not the stub to avoid recursion.

In `@src/target/stubs/nvrtc.cc`:
- Around line 40-50: Replace the fragile dlsym(RTLD_DEFAULT, "nvrtcVersion")
check in TryLoadLibNvrtc with a robust load/check sequence: avoid using
RTLD_DEFAULT (which can return the stub's own nvrtcVersion) and instead first
try dlopen with RTLD_NOLOAD on likely lib names to detect an already-loaded
libnvrtc, then if not found attempt dlopen of the library file (e.g.,
"libnvrtc.so" variants) to load it from disk; use the returned dlopen handle and
dlsym(handle, "nvrtcVersion") to confirm the symbol, and return that handle (or
nullptr on failure) — update TryLoadLibNvrtc and all callers to use the dlopen
handle rather than RTLD_DEFAULT to prevent self-resolution and allow standalone
loading.
🧹 Nitpick comments (1)
.github/workflows/dist.yml (1)

187-208: Wheel test only checks importability — no torch/CUDA runtime validation.

The test step installs the wheel and runs import tilelang, but doesn't install or test with any PyTorch version. Given the new stub design relies on PyTorch having loaded libcudart/libnvrtc into the global namespace, this test won't exercise the dynamic loading path at all. Consider adding at least a smoke test that installs torch and verifies CUDA stub resolution succeeds.

@oraluben oraluben marked this pull request as ready for review February 9, 2026 09:55
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/target/stubs/cudart.cc`:
- Around line 76-79: The current error path uses __builtin_unreachable() which
is only an optimization hint and doesn't guarantee termination; update the error
handling after the fprintf in the TileLang libcudart check to call abort() (from
stdlib.h) before the __builtin_unreachable() (or replace the builtin call with
abort()) so the process deterministically terminates and the error message is
visible; target the fprintf(...) statement and the following
__builtin_unreachable() call in cudart.cc when making this change.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/target/stubs/cudart.cc (1)

14-16: ⚠️ Potential issue | 🟡 Minor

Stale doc comment — dlopen() is no longer used.

Lines 15–16 say "The real libcudart is loaded lazily via dlopen() on first API call, and symbols are resolved via dlsym()." After this PR, dlopen is removed entirely; symbols are found via dlsym(RTLD_DEFAULT/RTLD_NEXT, …). Update the header to reflect the new strategy.

src/target/stubs/nvrtc.cc (1)

13-14: ⚠️ Potential issue | 🟡 Minor

Stale doc comment — dlopen() is no longer used.

Lines 13–14 reference lazy loading via dlopen(), but this PR replaces that with dlsym(RTLD_DEFAULT/RTLD_NEXT, …). Update accordingly.

🤖 Fix all issues with AI agents
In `@src/target/stubs/cudart.cc`:
- Around line 66-85: TryLoadLibCudart can return RTLD_DEFAULT (which is
((void*)0) on glibc) or RTLD_NEXT, so the nullptr check in CreateCUDARuntimeAPI
is invalid and causes a spurious "missing library" path; modify the code so
CreateCUDARuntimeAPI does not treat a returned RTLD_DEFAULT as failure — either
remove the handle==nullptr guard entirely or replace it with an explicit check
that treats RTLD_DEFAULT and RTLD_NEXT as valid sentinel handles (e.g., compare
against RTLD_DEFAULT/RTLD_NEXT or use a dedicated bool flag returned by
TryLoadLibCudart); update the logic around TryLoadLibCudart,
CreateCUDARuntimeAPI, and any callers of CUDARuntimeAPI/MissingLibraryError to
rely on that explicit sentinel instead of nullptr.

In `@src/target/stubs/nvrtc.cc`:
- Around line 42-60: The check in CreateNVRTCAPI that treats a nullptr handle as
a failure must be removed because TryLoadLibNvrtc() aborts on error and may
legitimately return RTLD_DEFAULT which is (void*)0 on glibc; update
CreateNVRTCAPI to accept the handle returned by TryLoadLibNvrtc() (including
RTLD_DEFAULT) and proceed to load nvrtc symbols (references: TryLoadLibNvrtc,
CreateNVRTCAPI, NVRTCAPI, nvrtcVersion, RTLD_DEFAULT, RTLD_NEXT) instead of
returning an empty NVRTCAPI when handle == nullptr.

@oraluben oraluben changed the title [Dist] Use one wheel for different CUDA version [Dist][Release] Use one wheel for different CUDA version Feb 9, 2026
@oraluben oraluben marked this pull request as draft February 9, 2026 15:30
oraluben and others added 2 commits February 9, 2026 23:32
…. This change simplifies the API initialization process in both cudart.cc and nvrtc.cc, ensuring that the function pointer lookups proceed without early returns.
@oraluben oraluben marked this pull request as ready for review February 10, 2026 01:50
…zy loading mechanism. Update documentation to clarify the purpose and implementation of stubs for better understanding.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants